15 research outputs found

    Extending in Silico Protein Target Prediction Models to Include Functional Effects.

    Get PDF
    In silico protein target deconvolution is frequently used for mechanism-of-action investigations; however existing protocols usually do not predict compound functional effects, such as activation or inhibition, upon binding to their protein counterparts. This study is hence concerned with including functional effects in target prediction. To this end, we assimilated a bioactivity training set for 332 targets, comprising 817,239 active data points with unknown functional effect (binding data) and 20,761,260 inactive compounds, along with 226,045 activating and 1,032,439 inhibiting data points from functional screens. Chemical space analysis of the data first showed some separation between compound sets (binding and inhibiting compounds were more similar to each other than both binding and activating or activating and inhibiting compounds), providing a rationale for implementing functional prediction models. We employed three different architectures to predict functional response, ranging from simplistic random forest models ('Arch1') to cascaded models which use separate binding and functional effect classification steps ('Arch2' and 'Arch3'), differing in the way training sets were generated. Fivefold stratified cross-validation outlined cascading predictions provides superior precision and recall based on an internal test set. We next prospectively validated the architectures using a temporal set of 153,467 of in-house data points (after a 4-month interim from initial data extraction). Results outlined Arch3 performed with the highest target class averaged precision and recall scores of 71% and 53%, which we attribute to the use of inactive background sets. Distance-based applicability domain (AD) analysis outlined that Arch3 provides superior extrapolation into novel areas of chemical space, and thus based on the results presented here, propose as the most suitable architecture for the functional effect prediction of small molecules. We finally conclude including functional effects could provide vital insight in future studies, to annotate cases of unanticipated functional changeover, as outlined by our CHRM1 case study.LM thanks the Biotechnology and Biological Sciences Research Council (BBSRC) (BB/K011804/1); and AstraZeneca, grant number RG75821

    Target prediction utilising negative bioactivity data covering large chemical space.

    Get PDF
    BACKGROUND: In silico analyses are increasingly being used to support mode-of-action investigations; however many such approaches do not utilise the large amounts of inactive data held in chemogenomic repositories. The objective of this work is concerned with the integration of such bioactivity data in the target prediction of orphan compounds to produce the probability of activity and inactivity for a range of targets. To this end, a novel human bioactivity data set was constructed through the assimilation of over 195 million bioactivity data points deposited in the ChEMBL and PubChem repositories, and the subsequent application of a sphere-exclusion selection algorithm to oversample presumed inactive compounds. RESULTS: A Bernoulli Naïve Bayes algorithm was trained using the data and evaluated using fivefold cross-validation, achieving a mean recall and precision of 67.7 and 63.8 % for active compounds and 99.6 and 99.7 % for inactive compounds, respectively. We show the performances of the models are considerably influenced by the underlying intraclass training similarity, the size of a given class of compounds, and the degree of additional oversampling. The method was also validated using compounds extracted from WOMBAT producing average precision-recall AUC and BEDROC scores of 0.56 and 0.85, respectively. Inactive data points used for this test are based on presumed inactivity, producing an approximated indication of the true extrapolative ability of the models. A distance-based applicability domain analysis was also conducted; indicating an average Tanimoto Coefficient distance of 0.3 or greater between a test and training set can be used to give a global measure of confidence in model predictions. A final comparison to a method trained solely on active data from ChEMBL performed with precision-recall AUC and BEDROC scores of 0.45 and 0.76. CONCLUSIONS: The inclusion of inactive data for model training produces models with superior AUC and improved early recognition capabilities, although the results from internal and external validation of the models show differing performance between the breadth of models. The realised target prediction protocol is available at https://github.com/lhm30/PIDGIN.Graphical abstractThe inclusion of large scale negative training data for in silico target prediction improves the precision and recall AUC and BEDROC scores for target models.The authors thank Krishna C. Bulusu for proof reading the manuscript. LHM would like to thank BBSRC and AstraZeneca and for their funding. GD thanks EPSRC and Eli Lilly for funding.This is the final version of the article. It first appeared from Springer via http://dx.doi.org/10.1186/s13321-015-0098-

    Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.

    Get PDF
    Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold

    Extending in Silico Protein Target Prediction Models to Include Functional Effects

    No full text
    In silico protein target deconvolution is frequently used for mechanism-of-action investigations; however existing protocols usually do not predict compound functional effects, such as activation or inhibition, upon binding to their protein counterparts. This study is hence concerned with including functional effects in target prediction. To this end, we assimilated a bioactivity training set for 332 targets, comprising 817,239 active data points with unknown functional effect (binding data) and 20,761,260 inactive compounds, along with 226,045 activating and 1,032,439 inhibiting data points from functional screens. Chemical space analysis of the data first showed some separation between compound sets (binding and inhibiting compounds were more similar to each other than both binding and activating or activating and inhibiting compounds), providing a rationale for implementing functional prediction models. We employed three different architectures to predict functional response, ranging from simplistic random forest models (‘Arch1’) to cascaded models which use separate binding and functional effect classification steps (‘Arch2’ and ‘Arch3’), differing in the way training sets were generated. Fivefold stratified cross-validation outlined cascading predictions provides superior precision and recall based on an internal test set. We next prospectively validated the architectures using a temporal set of 153,467 of in-house data points (after a 4-month interim from initial data extraction). Results outlined Arch3 performed with the highest target class averaged precision and recall scores of 71% and 53%, which we attribute to the use of inactive background sets. Distance-based applicability domain (AD) analysis outlined that Arch3 provides superior extrapolation into novel areas of chemical space, and thus based on the results presented here, propose as the most suitable architecture for the functional effect prediction of small molecules. We finally conclude including functional effects could provide vital insight in future studies, to annotate cases of unanticipated functional changeover, as outlined by our CHRM1 case study

    Bridging of anions by hydrogen bonds in nest motifs and its significance for Schellman loops and other larger motifs within proteins

    No full text
    The nest is a protein motif of three consecutive amino acid residues with dihedral angles 1,2-αRαL (RL nests) or 1,2-αLαR (LR nests). Many nests form a depression in which an anion or δ-negative acceptor atom is bound by hydrogen bonds from the main chain NH groups. We have determined the extent and nature of this bridging in a database of protein structures using a computer program written for the purpose. Acceptor anions are bound by a pair of bridging hydrogen bonds in 40% of RL nests and 20% of LR nests. Two thirds of the bridges are between the NH groups at Positions 1 and 3 of the motif (N1N3-bridging)—which confers a concavity to the nest; one third are of the N2N3 type—which does not. In bridged LR nests N2N3-bridging predominates (14% N1N3: 75% N2N3), whereas in bridged RL nests the reverse is true (69% N1N3: 25% N2N3). Most bridged nests occur within larger motifs: 45% in (hexapeptide) Schellman loops with an additional 4 0 hydrogen bond (N1N3), 11% in Schellman loops with an additional 5 1 hydrogen bond (N2N3), 12% in a composite structure including a type 1β-bulge loop and an asx- or ST- motif (N1N3)—remarkably homologous to the N1N3-bridged Schellman loop—and 3% in a composite structure including a type 2β-bulge loop and an asx-motif (N2N3). A third hydrogen bond is a previously unrecognized feature of Schellman loops as those lacking bridged nests have an additional 4 0 hydrogen bond

    Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

    No full text
    In the context of small molecule property prediction, experimental errors are usually a neglected aspect during model generation. The main caveat to binary classification approaches is that they weight minority cases close to the threshold boundary equivalently in distinguishing between activity classes. For example, a pXC50 activity value of 5.1 or 4.9 are treated equally important in contributing to the opposing activity (e.g., classification threshold of 5), even though experimental error may not afford such discriminatory accuracy. This is detrimental in practice and therefore it is equally important to evaluate the presence of experimental error in databases and apply methodologies to account for variability in experiments and uncertainty near the decision boundary. In order to improve upon this, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF comprises a modification to the long-established Random Forest (RF), to take into account uncertainties in the assigned classes (i.e., activity labels). This enables representing the activity in a framework in-between the classification and regression architecture, with philosophical differences from either approach. Compared to classification, this approach enables better representation of factors increasing/decreasing inactivity. Conversely, one can utilize all data (even delimited/operand/censored data far from a cut-off) at the same time as taking into account the granularity around the cut-off, compared to a classical regression framework. The algorithm was applied toward ~550 target prediction tasks from ChEMBL and PubChem. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information is not considered in any way in the original RF algorithm. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold). The RF models gave errors smaller than the experimental uncertainty, which could indicate that they are overtrained and/or over-confident. Overall, we show that PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. With this approach, we present, to our knowledge, for the first time an application of probabilistic modelling of activity data for target prediction using the PRF algorithm.</p

    Systematic Analysis of Protein Targets Associated with Adverse Events of Drugs from Clinical Trials and Postmarketing Reports.

    Get PDF
    Adverse drug reactions (ADRs) are undesired effects of medicines that can harm patients and are a significant source of attrition in drug development. ADRs are anticipated by routinely screening drugs against secondary pharmacology protein panels. However, there is still a lack of quantitative information on the links between these off-target proteins and the reporting of ADRs in humans. Here, we present a systematic analysis of associations between measured and predicted in vitro bioactivities of drugs and adverse events (AEs) in humans from two sources of data: the Side Effect Resource, derived from clinical trials, and the Food and Drug Administration Adverse Event Reporting System, derived from postmarketing surveillance. The ratio of a drug's therapeutic unbound plasma concentration over the drug's in vitro potency against a given protein was used to select proteins most likely to be relevant to in vivo effects. In examining individual target bioactivities as predictors of AEs, we found a trade-off between the positive predictive value and the fraction of drugs with AEs that can be detected. However, considering sets of multiple targets for the same AE can help identify a greater fraction of AE-associated drugs. Of the 45 targets with statistically significant associations to AEs, 30 are included on existing safety target panels. The remaining 15 targets include 9 carbonic anhydrases, of which CA5B is significantly associated with cholestatic jaundice. We include the full quantitative data on associations between measured and predicted in vitro bioactivities and AEs in humans in this work, which can be used to make a more informed selection of safety profiling targets

    Multitask Bioactivity Predictions Using Structural Chemical and Cell Morphology Information

    No full text
    The understanding of the Mechanism-of-Action (MoA) of compounds and the prediction of potential drug targets has an important role in small-molecule drug discovery. The aim of this work was to compare chemical and cell morphology information for bioactivity prediction. The comparison was performed by using bioactivity data from the ExCAPE database, image data from the Cell Painting data set (the largest publicly available data set of cell images with approximately ~30,000 compound perturbations) and Extended Connectivity Fingerprints (ECFPs) using the multitask Bayesian Matrix Factorisation (BMF) approach Macau. We found that the BMF Macau and Random Forest (RF) performance was overall similar when ECFP fingerprints were used as compounds descriptors. However, BMF Macau outperformed RF in 155 out of 224 target classes (69.20%) when image data was used as compounds information. By using BMF Macau 100 (corresponding to about 45%) and 90 ( about 40%) of the 224 targets were predicted with high predictive performance (AUC > 0.8) with ECFP data and image data as side information, respectively. There were targets better predicted by image data as side information, such as b-catenin, and others better predicted by fingerprint-based side information, like proteins belonging to the G-Protein Coupled Receptor 1 family, which could be rationalized from the underlying data distributions in each descriptor domain. In conclusion, both cell morphology changes and structural chemical information contain information about compound bioactivity, which is also partially complementary, and can hence contribute to in silico mechanism of action analysis. </p
    corecore